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Abstract 

We discuss a model of protein conformations where the conformations are combinations 
of short fragments from some small set. For these fragments we consider a distribution of 
frequencies of occurrence of pairs (sequence of amino acids, conformation), averaged over 
some balls in the spaces of sequences and conformations. These frequencies can be estimated 
due to smallness of the e-entropy of the set of conformations of protein fragments. 

We consider statistical potentials for protein fragments which describe the mentioned 
frequencies of occurrence and discuss model of free energy of a protein where the free energy 
is equal to a sum of statistical potentials of the fragments. 

We also consider contribution of contacts of fragments to the energy of protein confor¬ 
mation, and contribution from statistical potentials of some hierarchical set of larger protein 
fragments. This set of fragments is constructed using the distribution of frequencies of oc¬ 
currence of short fragments. 

We discuss applications of this model to problem of prediction of the native conformation 
of a protein from its primary structure and to description of dynamics of a protein. Modifi¬ 
cation of structural alignment taking into account statistical potentials for protein fragments 
is considered and application to threading procedure for proteins is discussed. 


1 Introduction 

In the present paper we consider a model of protein free energy based on bionformatics. We 
construct a model of statistical (or empirical) potentials for short fragments of proteins. Applica¬ 
tion of this model to problem of prediction of the native conformation of a protein using primary 
structure of the protein, and to investigation of protein dynamics is discussed. 

The main approach in physics is the construction of physical models starting from hrst prin¬ 
ciples (fundamental interactions). This approach is effective for systems with low complexity 
(the complexity can be understood as Kolmogorov complexity), but is less effective for complex 
systems, in particular in biology. 

Application of computations from hrst principles to modeling of proteins results in a large 
amount of calculations, moreover, the computed dynamics will be unstable — small perturbations 
of parameters of the model will result in large changes in behavior, not only the dynamics but also 
the lowest energy state may change. Actually this property of proteins to have complex behavior 
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which can be adjusted by changing their primary structures (sequences of amino acid residues in 
proteins) is crucially important for biology. 

For complex systems we discuss the following approach — instead of description from first 
principles one can consider a model which describes effective behavior of the system using a 
knowledge base about the system. The system will be described by some large database. Our aim 
is to construct a big data physical model for conformations and dynamics of proteins. 

Models of complex and disordered systems were widely investigated in theory of spin glasses 
[1]. In this case the idea of genericity was used and the details of the disorder in the system (in 
particular the disorder matrix for the Sherrington-Kirkpatrick model) were not important. In this 
paper we will discuss how to take into account the particular properties of the disorder represented 
by some database. 

For the investigation of proteins we will use a combination of models of fragments and statistical 
potentials. In models of statistical (empirical) potentials effective interactions in the system are 
reproduced using the empirical distribution functions, see 0 . 0 - This approach is widely used 
in physics of proteins. For more discussion of physics of polymers see m, m- 

In the approach of fragments protein conformations are represented as combinations of short 
fragments (for instance fragments may be of the length five amino acid residues) |5l [6l [71 El [9l EOl 
HI[12], [lanails]. It was shown that with respect to some natural metrics the conformations 
of fragments can be clustered in small set of clusters of small radii. In the present paper we 
discuss this phenomenon as smallness of e-entropy of the space of conformations of fragments. 
Lattice version of the model of fragments was discussed in [TB], it was shown that in this model 
one obtains lattice models of protein secondary structures. 

We construct a statistical potential for pairs (sequence of amino acids, conformation) for short 
fragments of proteins using averaging over some balls in the space of sequences. This statistical 
potential can be used for modeling of protein conformations. One can take into account the 
contacts of protein fragments in similar way. The obtained model of protein free energy will be a 
model with non-local cooperative interaction based on statistical potentials for protein fragments 
(an example of big data physical model). 

The next component of the model is the hierarchical analysis of protein structure (a general¬ 
ization of the approach of |T3[ |T1[ [I5]). We consider distribution of frequencies of occurrence of 
short protein fragments as a function of a number of a fragment in the protein, and investigate 
the hierarchy of local maxima and minima for this function. We consider the contribution to the 
free energy of a protein from statistical potentials of the mentioned hierarchical set of structures. 
Hierarchical structure of polymer globules was also discussed for DNA packing HZ]. 

The functional of free energy for this model will depend on approximately thousands of param¬ 
eters. Most complex physical models have approximately dozens of parameters. This shows the 
difference of the degree of complexity between physical and minimally complex biological systems. 

The exposition of the present paper is as follows. In section 2 we consider the model of frag¬ 
ments and in section 3 the construction of statistical potentials which describe joint distribution of 
pairs (sequence, conformation) for fragments of proteins. The construction of statistical potentials 
is based on the smallness of e-entropy of the space of conformations of fragments. In section 4 we 
construct statistical potentials for contacts of fragments in analogous way. In section 5 we discuss 
hierarchical systems of larger fragments and statistical potentials for these systems. In section 6 
we discuss a model of combinatorial optimization for construction of native conformation of the 
protein from its primary structure, and consider a modification of the threading method using 
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statistical potentials of fragments. In section 7 we discuss the structural alignment construction 
related to the model of fragments and statistical potentials discussed in this paper. In section 8 
we discuss application to protein dynamics. Section 9 contains a conclusion of the paper. 


2 Space of fragments 

We consider the problem of correspondence between the primary structure and the native confor¬ 
mation of a protein. We use a variant of model of fragments of proteins (models of fragments were 
considered in many works, in particular in [HI El [TJ [HI El HDl IHl US])- We consider a database S — 
a set of fragments of proteins. All fragments in S have a hxed short length N (for instance one 
can take N equal to hve amino acid residues). The database S under consideration will contain 
pairs (/, r) of the form (sequence of amino acids in the fragment, conformation of the fragment). 
We will describe the function P{I, T) of joint distribution of sequences and conformations of frag¬ 
ments, using the database S. Then, using this function we will discuss native conformations of 
proteins. 

Let us denote X and Y correspondingly the sets of all sequences of amino acid residues in 
fragments and conformations of fragments (not necessarily in S). Let us consider several metrics 
on these sets. 

Metric on the set of sequences of amino acids in fragments. Let us consider the metric 
on the space X of sequences of amino acids in fragments of length N 

N 

= ( 1 ) 

1=1 

Here / = A • • -^n, J = ji ■ ■ -Jn are sequences of amino acids (p, ji are the /-th amino acids in 
fragments /, J), {A{i,j)) is the matrix of some probabilistic model of evolution of proteins (PAM 
or BLOSUM matrices, see for example [H]), i.e. the matrix element A{i,j) is large when the 
probability of transition between amino acids i and j in the evolution model is large, p is positive 
decreasing function. 

This means that the distance between the fragments /, J is equal to the sum of distances 
between amino acids in the corresponding positions, the distance between the amino acids is large 
when the probability of substitution of the amino acids in the model of evolution is small. Thus 
d(-, •) measures the evolutionary distance between the fragments. 

Metrics on the set of conformations of fragments. A metric on the space Y of conforma¬ 
tions of fragments can be introduced in different ways. One can parameterize the space Y using 
sequences of pairs of dihedral angles (0, ip) (the Ramachandran diagrams) for the corresponding 
amino acids. In particular for fragments of length hve the space Y is the 8-dimensional torus (here 
we take into account the four inner vertices of the fragments and the corresponding dihedral an¬ 
gles). For two conformations Ti, r 2 which correspond to sequences of dihedral angles {(0",'0")}, 
{( 02 ) 02 )} consider the metric of root mean square deviation 


s(ri,r2) 


N-l 




E 
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Another ways to introduce a metric are as follows 



)' + (V-f 



N-1 




Q;=l 



Metrics can be also defined in alternative way using representations of the conformations Fi, 
r 2 by sets of coordinates of C^-atoms of the fragments in the three dimensional space. The metric 
can be introduced by the expression 


N 

s'(ri,r 2 ) = . + {yf - + {zf - z^y), 

\ a =0 

where (x", yf, zf) are the coordinates of C'„-atom in the conformation Fi (analogously (x 2 , 1/2 , 2^2 ) 
for F 2 ). Here we choose the embeddings of conformations Fi, F 2 in in such a way that the 
metric s' is the inhmum over possible embeddings of conformations. The summation runs over 
Ca-atoms (for a fragment of length N one has iV + 1 such atoms). 

Another possible metrics have the forms 

max V - x^y + {yf - y^f + (zf - z^Y, 

Q = 0 

N 

Y, VK - + {vi - y2f + (^i“ - 

< y .=0 

Coarse graining of the space of conformations of fragments. It is known from the anal¬ 
ysis of protein conformations in model of fragments lainiiTiiHiinmniiiiiiig that the known 
conformations of fragments of proteins are concentrated in some sufficiently small subset of the 
space Y. For the description of this phenomenon one can use a coarse graining procedure for 
observable conformations in Y using a covering of the set S' C H by some set of balls (with respect 
to some of the discussed above metrics in the space Y). 

In particular [6], it was shown that all observable in experiments conformations in S will belong 
to some covering of S containing about one hundred of balls of small diameter £ (approximately 
1 A in metric s' of root mean square deviation). Total number of balls of the same diameter in 
a covering of all space Y is of orders of magnitude larger. 

Therefore e-entropy of observable in proteins subset of the space Y of conformations of frag¬ 
ments is very small. Experimentally observable conformations of fragments are very specihc if we 
ignore differences between conformations at small distances. 

Taking into account this observation we denote Y^ the set of coarse grained observed conforma¬ 
tions of fragments. The set Y^ can be understood as a covering of the set of conformations in S by 
balls of the diameter e: with respect to some of the described above metrics in Y, or equivalently 
this set can be considered as £-net for S' C T (if we put in correspondence to a ball the center of 
this ball). 

Usually when discussing e-entropy one considers the dependence of the entropy on e. In the 
case under consideration we are interested in the entropy for fixed e (approximately 1 A). 
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£—Entropy. Let us recall the definition of e-entropy, see for instance |19j . 

Let A be non-empty subset in a metric space R. 

A system 7 of sets U <Z R is called e-covering of A if the diameter of any [/ G 7 is less or 
equal 2e and A belongs to the union of 17 G 7 . We denote N^{A) the minimal number of sets in 
a e-covering of A. 

Then H^{A) = logiV£( 74 ) is the e-entropy of the set A. 

Protein conformations and seqnences of fragments. For a protein with primary structure 
I (the sequence of amino acid residues) and conformation F we consider the corresponding set 
(/i,Fj) of fragments of length iV = 5 in the protein (/, F), where R is the sequence and Fj G W 
is the conformation of the Fth fragment in the protein (i.e the conformation is coarse grained in 
the discussed above sense). 

Therefore a conformation F of a protein can be represented by a sequence (Fj) of symbols from 

which describe coarse grained conformations of fragments. 

The described above procedure can not generate an arbitrary sequence (Fj) of conformations 
of fragments starting from some protein conformation. Conformations of neighbor fragments must 
be compatible, i.e. should have well defined intersection. For conformations Fj, Fj+i from of the 
neighbor fragments the distance between the shorter fragments corresponding to the intersection 
of Fj, Fj+i should be less or equal e. 

Let us consider the matrix (Cfa), F, A G W of intersections of fragments. The matrix ele¬ 
ment Cya = 1 if the pair (F, A) of conformations of fragments is compatible, i.e. there exists a 
conformation of the fragment of the polymer of length six, where the first five residues have the 
conformation F and the last five residues have the conformation A. In the opposite case, if the 
pair (F, A) is incompatible, we put Cya = 0. This matrix nonsymmetric since a peptide chain is 
directed. 

The matrix (Cpa) is sparse, i.e. majority of matrix elements will be equal to zero. This matrix 
(for length of fragments N = 5) has the dimension about 100 x 100, and the number of non-zero 
matrix elements will be of order of thousand. Sparsity of this matrix puts considerable constraints 
on the size of the set of possible conformations of proteins. For the lattice version of model of 
fragments sparsity of this matrix implies lattice analogues of secondary structures |16j . 


3 Statistical potentials for the model of fragments 

In order to construct joint distribution function of sequences of amino acids in fragments of proteins 
and conformations of the fragments starting from the database S we have to perform averaging 
over sequences, because even after the coarse graining of conformations the database will be too 
small. We will use averaging over balls with respect to metric ([I]) in the space X of sequences (i.e. 
use methods of nonparametric statistics). 

We construct joint distribution function / G A, F G W of pairs (sequence, confor¬ 

mation) from the database S using averaging of the data (/', F') G S over the ball of diameter 6 
with center in I with respect to metric of the form ([1]) in the space X of sequences. Here 

the conformation F' should belong to the corresponding F G element of the covering of the 
set of conformations of fragments and < 6/2. Therefore the function Ps^s{I,T) is equal 

to the number of fragments from S which belong to the direct product of the ball in X and the 
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element F G of the covering (i.e. Ps^s{I,T) is the coarse grained joint occurrence frequency for 
conformation and sequence of the fragment). 

The diameter 6 used in computation of Ps,s{Ii T) is taken as follows: in average for fragments 
in a ball with center in I for a fragment of length hve three (or more) amino acids coincide with 
the corresponding amino acids in I and two can differ. 

A sample S of the fragments is random, thus the function Ps^e{I,T) is also random. For 
sufficiently large samples S and diameters 6 this function should converge to a deterministic 
function. 

Statistical potentials. For modeling of proteins one can use statistical (or empirical) potentials. 
The notion of a statistical potential is based on the idea that properties of proteins are distributed 
with respect to the Boltzmann distribution, see for example 0 . 0 . Let p be some property of 
the protein (for instance some set of coordinates of relative positions of some residues). One can 
introduce the corresponding statistical potential by the formula 

E{p) = -logu(p), 

where n(p) is the observable value of occurrence in the database of the property p. Property p 
may describe distances between Oa-atoms of the backbone, values of dihedral angles etc. 

In particular the Miyazawa-Jernigan matrix of energies of pairwise interactions of amino acids 
built using the statistics of contacts of amino acids in proteins is an example of statistical potential 

ra- 

In the above examples the statistical potentials were used for description of real interactions 
in proteins. We consider a different point of view — we describe proteins by a set of convenient 
parameters (using the model of fragments), then dehne the energy of the model using statistical 
potentials for the parameters under consideration. 

Statistical potentials for the model of fragments. Let us consider the model of free energy 
of a protein, where the energy will be equal to a sum of contributions of fragments. For a fragment 
(/, F), / G X, F G W we dehne energy of the pair /, F by the coarse grained joint occurrence 
frequency as the statistical potential 

$(J,F) = -logP 5 ,.(/,r). (2) 

For a protein with the conformation F and the primary structure I we introduce the free energy 
functional 

Fo(/,r) = 5^«h(A,F,), (3) 

i 

where the summation runs over fragments of length hve in a protein, A is the sequence and Fj is 
the conformation of the Tth fragment of the protein (J, F). Physical meaning of this functional 
is described by the assumption that conformations of fragments with low energy are frequent in 
proteins (i.e. proteins are selected to make energies of native conformations to be low, equivalently, 
to make native conformations stable). 

In particular, <F(/j,Pj) should describe local properties (for example elasticity) of a protein 
at the Tth fragment. This idea can be used for comparison of diherent proteins and for the 
investigation of protein dynamics, see below. 
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Remark. Substitution of a small number of amino acid residues in a fragment / by similar 
residues will not change dramatically the probabilities for this fragment to take conformations 
r G Y^. Therefore the function <h(/,r) should depend on I regularly, for example should satisfy 
the Lipschitz condition with respect to the metric d in X\ 

|^>(/,r)-<h(j,r)| <cd(/,j). (4) 

Similar discussion for the frequency of occurrence of fragments I in databases (without taking 
into account the conformations of fragments) one can hnd in [IT] . 

From the standard point of view conformations of polymers (in particular, fragments of pro¬ 
teins) belong to continuous space, and sequences of residues in polymers belong to discrete set. 
Coarse graining of conformations of fragments and the Lipschitz continuity property (jl]) shows 
that for fragments of proteins it is natural to consider the opposite assumption — the conforma¬ 
tions of fragments, up to small deformations, belong to small discrete set 1^, and the sequences 
(from the point of view of probabilities to take conformations in Y/.) are quasi-continuous. 

4 Contacts of fragments 

It is natural to discuss the contribution to free energy of a protein from contacts of fragments. Let 
us consider amino acid residues with the numbers i and j in the protein, which are in contact (i.e. 
distance between the residues is sufficiently small) and the residues are not neighbors: \i — j\ > 1. 
The corresponding fragments with the centers in z, j are Tj, Tj. 

One has to distinguish the different kinds of contacts. Contacts in secondary structures are 
related to hydrogen bonds between the residues in the backbone of peptide chain. These contacts 
are not very specihc (dependence on the types of amino acids is low) and their energy in some 
approximation can be taken proportional to the number of hydrogen bonds (i.e. the number of 
contacts). 

Contacts between the different secondary structures and contacts in loops are made by side 
chains of amino acids and strongly depend on geometry and other properties of the corresponding 
fragments. Let us assume that the energy of a contact is determined by the contacting amino acid 
residues and their neighbors, i.e. by the contact of fragments of length three. If we would try to 
take into account longer contacting fragments it would be more complicated to hnd the statistics 
for contacts. 

Thus we consider a contact of fragments (of length hve) (/, Ti) and (J, r 2 ), contacting in 
the central (third) residues in the fragments, here ri,r 2 G Y^. For these fragments we consider 
subfragments (/,Fi), (J, F 2 ) of length three (with the same centers), where /, J are given by 
restriction of /, J correspondingly. The conformation F can be considered as corresponding to a 
union of the sets F G such that the restrictions of conformations of fragments in these sets to 
fragments of length three (with common central residue) have the distance between these restric¬ 
tions of conformations less or equal to e. The corresponding covering of the set of conformations 
of fragments of length three we denote Y^. 

Let us consider for a specihc contact (i.e. contact between the different secondary structures 
or loops) the statistical potential 

4/(1, J, Fi, r 2 ) = — log (frequency of occurrence (J, J, Fi, 1 ^ 2 )) , (5) 
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where we consider the frequency of occurrence of contacts of fragments (/, Fi), (J, r 2 ) where the 
centers of the fragments are in contact if the distance between the centers is less than some e'. 

Let us add to the functional ([3]) of free energy of a protein the term which describes the contacts 
of fragments 

Fi(/,r) = Ai 5^ 1 + A 2 (6) 

ijeCi id&C2 

Here I is the sequence of amino acids in a protein, F is the conformation of a protein; Jj, Fj 
correspond to fragments of length hve; /*, Fj correspond to fragments of length three; Ci is the 
set of contacts in secondary structures (non specific contacts), C 2 is the set of specific contacts; 
in the sums the i-th and j-th residues are in contact; Ai, A 2 > 0 are some constants. 

One can also take into account (lower) specihcity of contacts in secondary structures (in par¬ 
ticular beta sheets), substituting the summands in the hrst sum (over non specihc contacts) in ([ 6 ]) 
by the analogue of (|5]), computed using some database. For example one can take into account 
hydrophobic interaction of side chains of amino acids. 


5 Hierarchy of structures in proteins 

In we present section we consider a generalization of the model of fragments ([3]), ([ 6 ]) which takes 
into account a hierarchy of fragments of different lengths. In addition to the database S of 
fragments we will use a database T of proteins (this database should contain primary structures 
of proteins and their native conformations). We will understand conformations of proteins as 
sequences (Fj) = Fi... Fm of coarse grained conformations of fragments. Thus a protein (J, F) 
from T generates a sequence (/i,Fj) of fragments. 

Let us consider for a protein (J, F) G T values $(/i, Fj) of statistical potentials of fragments as a 
function of the number i of the fragment. This function can be considered as a stepwise real valued 
function $(a;) on the interval [1/2, M + 1/2], where the function at the interval [z — l/2,z-|-l/2] 
equals to <h(/j, Fj). 

Let us apply to this function the smoothing procedure by convolution with gaussian function 
—^e~ 2 ^ with the variance 2 ct (2a is close to one) 

ayZTT ^ ' 

1 

—-=e 2a2 dy. 

Let us construct for any protein (/,F) G T a tree T(/,F) of ’’hierarchy of basins” [21] as 
follows. Let us fix some set {qj} of real numbers (the function f(x) should take values which lie 
between some qi and g^). This set can be taken ordered with respect to increasing of the indices, 
i.e. qi < qj for i < j. 

Let us consider in the interval [1/2, M -|- 1/2] the set {x : f{x) < qj}. This set is a union of 
intervals. These intervals are partially ordered by inclusion and the partial order is described by 
a tree, i.e. any two intervals either can be nonintersecting (modulo set of measure zero), or one of 
the intervals will contain the other (in this case the intervals will correspond to different qj). The 
obtained tree we denote T(/, F) (this tree depends on the hierarchy {qj} of barriers). Vertices of 
the tree correspond to the intervals (’’basins”), partial order in the tree is defined by inclusion of 
intervals, edges connect intervals embedded without intermediaries. 






In the limiting case when the set {gj} is dense we will get the hierarchical partition of the 
interval [1/2, M + 1/2] by local maxima of the function f{x), in general case we get a coarse 
grained hierarchical partition of [1/2, M + 1/2], 

Similar procedure of hierarchical partition of proteins was discussed in papers by Nekrasov 
et. al. [13], [S], [13] (where the frequencies of occurrence for fragments Jj were considered and 
the conformations Fj of the fragments were not taken into account). It was shown that maximal 
branches in obtained trees correspond to domains in proteins. 

Each interval from the described system of ’’basins” corresponds to some branch of the tree 
T(/,r). This branch contains a vertex corresponding to the interval and all vertices which are 
less with respect to the partial order (corresponding to subintervals of the given interval). We 
will denote the interval and the corresponding branch of the tree by A. For a given branch A 
let us consider a set of integer points i belonging to the interval A, and the corresponding set of 
fragments with conformations Fj. We will denote (Fj)^ the obtained sequence of conformations 
of fragments. 

Then, for all obtained in the described way sequences of fragments (Fj)^ we compute the 
frequencies of occurrence in the database T and the corresponding statistical potentials 

^(A) = — log (frequency of occurrence of (Fj)^ in T). 

The idea of consideration of statistical potentials *h(2l) of branches of trees T(/,F) is based 
on the following observation: any branch of the trees under consideration (i.e. a sequence (Fj)^) 
corresponds to some conformation which may occur in many proteins. In particular in |13j . 
[H], [15] it was shown that maximal branches correspond to domains in proteins. Thus the 
conformational entropy of the hierarchical set of elements in proteins under discussion should be 
comparably low (in comparison to arbitrary possible sequences of conformations of fragments). 
Therefore we can use the database of conformations related to branches A for investigation of 
conformational structures of proteins. 

We will get a database of branches A and corresponding conformations of fragments of proteins 
(Fj)^. This database does not contain information about sequences of amino acids in the fragments 
since the entropy of sequences is large. 

Let us consider the contribution to the free energy of a protein (J, F) of the form 

F,_(I,r)= (7) 

A:(ri)Acr 

Here the summation runs over branches A where the corresponding conformations are subse¬ 
quences of neighbor fragments in F = Fi...Fm, i-e. (Fi)^ C Fi...Fm- There is no explicit 
dependence on the primary structure I of the protein in the above expression (primary structure 
was used for the construction of the branches A of the tree T(/, F)). 

Expression ([7]) for the energy of conformation contains the information about selection of 
conformations of protein segments at the level of branches of the tree T(/, F). Positive constants 
A(H) should grow with increasing of length of the sequence (Fi)^ (to compensate small values of the 
statistical potential <I>(y4)). It is natural to consider A(y4) which depend only on the length of such 
sequence. Thus the expression for F 2 means that conformations corresponding to hierarchically 
embedded space structures frequent in proteins are energetically prohtable. 
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The total free energy of a protein in the model nnder consideration is a snm of contribntions 
dSD, (ED and dZD 

F(/, T) = Foil, T) + Fi(J, T) + F2(/, T). (8) 

Remark. The described above hierarchical markup of proteins is analogous to hierarchical syn¬ 
tax markup of texts. Text is a sequence of letters, which can be considered as containing several 
levels of hierarchy — letters, words (combination of letters), collocations (combinations of words), 
phrases. Each level of hierarchy has lesser entropy in comparison to the set of arbitrary combina¬ 
tions of elements of previous level of hierarchy (for example, number of words is much less than 
the number of possible combinations of letters). 

Remark. Protein folding is a cooperative transition (similar to the first order phase transition). 
But in one dimensional systems there should be no phase transitions. One can consider the term 
(jTD in the model of the free energy of a protein as a description of this cooperative transition. This 
contribution is non-local, which may be compared to the existence of long range interactions in 
the Sherrington-Kirkpatrick model of spin glasses which predicts the glass transition (described 
by the hierarchical replica symmetry breaking method). 

6 Methods of analysis of protein conformations 

In the present section we discuss application of the introduced above model ([HD of statistical 
potentials of fragments to the investigation of protein conformations. 

Let us discuss the problem of reproducing of the native conformation of a protein starting from 
its primary structure. We discuss two approaches. The hrst is based on minimization of the free 
energy functional ([HD over conformations T (the problem of folding). The second is a variant of 
threading method where the scoring function for protein comparison is constructed with the help 
of functionals (ED and (ED- 

Problem of folding (combinatorial optimization). The aim is, starting from the primary 
structure I (sequence of amino acid residues in a protein), to reproduce the native conformation 
T of a protein as the global minimum of the free energy functional ([HD- We have to construct the 
sequence of conformations of fragments Tj which minimizes ([HD- 

In this statement the problem of folding is a problem of combinatorial optimization. The 
condition of compatibility of neighbor fragments (namely, sparsity of the matrix (Cfa) of inter¬ 
sections of fragments) considerably reduces the volume of the set for brute-force search for global 
minimum. 

Complexity of this problem of combinatorial optimization can be reduced in the following way. 
Let us hnd fragments where the conformation is dehned unambiguously by the sequence (i.e 
<h(/j,rj) for such Jj is concentrated on the unique Tj). We will call of this kind certain. Then 
for construction of the native conformation of a protein it is sufficient to minimize the functional 
of free energy on subsequences lying between two certain fragments. If the distance between two 
certain fragments is not very high then the problem of combinatorial optimization will not be very 
complex. 

It was shown [IT] that short fragments with high occurrence frequency are situated sufficiently 
frequent along the protein sequence (long segments with low occurrence frequency of short frag- 
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ments are very rare). From the point of view of the functional ([8]) this results in reduction of 
complexity of the folding problem. 

Problem of minimization of energy (|8]) is the analogue of the maximal likelihood method in data 
analysis. Here the likelihood functional (the product of occurrence frequencies of conformations of 
fragments multiplied by the contributions of contacts of fragments and of the hierarchy of longer 
fragments) will be equal to exponent of the taken with opposite sign energy. 

Hierarchical approach to folding. Let us discuss the problem of folding from the point of 
view of hierarchy of branches H in ([7]), (jH]). Search of minima of energy over conformations can 
be organized in different ways. Let us consider the following hierarchical algorithm: at the hrst 
step we search fragments (/ijPi) with minimal energy (i.e. unambiguously foldable), then we try 
to include these fragments in different branches (Pi) a, growing these branches by inclusion. The 
search will (in some approximation) be greedy — at every step we have to minimize the energy 
for one branch (hierarchically increasing this branch), then we combine the obtained branches at 
the higher level of hierarchy of branches. Thus we use the database of conformations related to 
branches A for reducing of volume of the search. 

In the literature [2] it was discussed that folding in real proteins is similar to the described 
above greedy procedure (with formation of nuclei of the native structure which grow in volume). 
Greediness of the search can considerably reduce volume of this search. This volume (for the 
described greedy search) can be estimated by the number of levels of hierarchy multiplied by 
the complexity of the search for one unit of hierarchy. This in principle can reduce the search 
from brute force enumeration of all conformations to the search over the set with the volume 
proportional to logarithm of the number of conformations. Similar point of view (hierarchy plus 
greedy search) is the basis of ’’deep learning” approach in machine learning, i.e. effective learning 
of multilayer hierarchical neural networks [22], [23] . 

Using this discussion, we conjecture that the Levinthal paradox (the problem of search for the 
global minimum of energy over exponentially large space of conformations) might be solved with 
the help of greedy search in the described hierarchical model of energy ([H]) based on the data of 
bioinformatics. 

Threading. In some approximation the majority of protein folds are known. Therefore na¬ 
tive conformation of a protein can be found by comparison of a protein and proteins from some 
database with known native conformations. The procedure of threading for recognition of confor¬ 
mation has the following form: a sequence of a protein is aligned with sequences of proteins with 
known conformations. Then the protein with better alignment can be used for modeling of the 
conformation of the protein under investigation. 

Let us discuss a generalization of threading method based on the idea to take into account 
the statistics of short fragments of proteins. We will compare a protein with the sequence I (and 
unknown native conformation) with a protein J from the database with conformation T using 
alignment of I and J. 

Let us put the protein I in conformation T (as J) and consider the function <F(/j,rj) as a 
function of the fragment Jj — the Tth fragment in I. Let us consider the similarity functional 

rr(/,J) = 5^|<F(/i,r,)-<h(J.,r,)|, (9) 

i 

where J* is the i-th fragment in J. This functional depends on the alignment of the two proteins 
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and the corresponding alignment of fragments. One can consider a generalization of this functional 
by adding to (jO]) some terms which compare the contacts of fragments of the form ([5]). 

Two proteins I and J are similar if local physical properties of these proteins in conformation 
r (the native conformation of J) are similar with respect to the functional of free energy ([3]). In 
this case the value of functional ([H]) will be low and one can assume that these proteins have the 
same native conformation. 

One of generalizations of threading method is to consider threading not of a single protein but 
a family of threadings of homologs and take some averaging . This approach has some similarity 
with the described above where we take into account the statistics of fragments averaged over some 
neighborhood of fragments in the primary structure I. 

7 Structural alignment and model of fragments 

Application of statistical potentials for fragments allows us to modify not only threading procedure 
but also alignment construction. In the present section we introduce a version of structural 
alignment construction related to the model of fragments. Structural alignment is a modihcation 
of alignment of proteins which takes in account the structures of proteins (information about the 
conformations). 

Let us recall the dehnition of alignment, see for example [IB]. Edit distance between two 
sequences of symbols from hnite alphabet is the minimal number of edit operations which map one 
sequence to the other. Set of edit operations usually contain insertions, deletions and substitutions 
of symbols. One can consider global alignment of sequences and local alignment (of segments of 
sequences). 

Definition of alignment. Let .4, be a fc-letter alphabet, V and W be hnite sequences of symbols 
from A. Let 4.' = .41J{ —} be extended alphabet where the symbol {—} is called the gap symbol. 

Alignment of two sequences V = Vi.. .Vn and W = Wi... Wm is a matrix with two lines of 
equal length I > n,m, the hrst line of the matrix contains a sequence_J^ = vi...vi obtained from 
V by insertion of / — n gaps in some order, the second line contains W = wi.. .wi obtained from 
W by insertion of / — m gaps. Columns with two gaps are forbidden. 

Columns of the alignment matrix which contain gaps at the hrst line are called insertions, 
columns containing gaps at the second line are deletions. Columns containing identical symbols 
in both lines are called matches, and columns containing diherent symbols are called mismatches. 

One puts in correspondence to each column of the alignment matrix the score - a real number 
depending on the symbols in the column. Score of the alignment is a sum over columns of the 
matrix: 

i 

= ( 10 ) 

i=l 

The alignment problem is to hnd alignment with maximal score. 

Alignment (C, W) of sequences V, W corresponds to a sequence of edit operations of the 
line V which map V to W. Edit operations correspond to columns of the alignment matrix (E, 
W) and can be performed in arbitrary order. The operations are as follows: insertion of gap in 
the line V (at the position corresponding to the column), deletion of a symbol in V, mismatch 
corresponds to substitution of a symbol from V by corresponding symbol from W. 
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In the simplest case one can put scores equal to 6{x,x) = 1 for matches, S{x,y) = —/i for 
mismatches and S{x, —) = 6{—,x) = —a for insertions or deletions: 

score ( alignment ) = #( matches ) — /r#( mismatches ) — cr#( indels ). 

For alignment of proteins one can use PAM or BLOSUM matrices for scores of amino acid 
substitutions, i.e. the score for substitutions of amino acids i and j will be a function of the matrix 
element Aij. 

Structural alignment. Let us consider alignment {V, W) of sequences V = vi...Vn and 
W = Wi... Wm (sequences of amino acids in some proteins). Let the conformation F of the protein 
W be known, and let us consider the corresponding sequence of conformations of fragments (of 
length hve) F = F 3 F 4 ... Fm -2 (here we enumerate fragments by numbers of amino acids in central 
positions of the fragments). 

In the aligned sequences V, W one can consider fragments Vi, Wi of J^gth hve with centers 
in the i-th positions. We consider only the case wlien both fragments 1^, hFj do not contain gaps. 
Then, let Fj be a conformation of the fragment Wi (obtained by restriction of the conformation 
of the protein (IF, F)). 

Let us consider statistical potentials $ ^F,Fij, F ^lFi,Fjj and dehne the structural score for 
these fragments by the formula 

5 (Vi,Wi,v'J = 1$ - $ (w,r,) |. (11) 

The score 6 Wi, Fjj is low when the statistical potentials of fragments F, IF in the con¬ 
formation Fj are similar. These structural scores dehne contributions to the score functional (nH 
of the alignment. Let us note that here instead of maximization we consider minimization of the 
score of the alignment (i.e. the lower score is the better). 

Another case of the construction under consideration is the local alignment of segments in 
sequences F and IF without gaps. The alignment score takes the form 

5^|$(F,F)-<I>(W,F)|, (12) 

i 

i.e. we align segments of equal length without gaps in sequences F, IF and consider the correspond¬ 
ing alignment of fragments F, {Wi, Fj) (of length hve) lying in the segments. The alignment score 
will be good (low) when statistical potentials of the aligned fragments in the same conformations 
will be similar. 

Both functionals flTT]) . flT^ are nonlocal with respect to sequences F, IF. Thus there are no 
algorithms of dynamical programming for hnding extremes of these functionals. One can construct 
local alignment of short segments and then elongate these segments (as in the BLAST algorithm) 
taking into account the structural alignment score. 

8 Application to protein dynamics 

In the discussed above approach we put in correspondence to a protein, in addition to the primary 
structure and the native conformation, another two sequences — a sequence (Jj) of fragments and 
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a sequence (Fj) of coarse grained conformations of fragments; we consider two sets of contacts be¬ 
tween fragments (non-specific and specific); and two functions <f> and \1/ (frequencies of occurrence 
of fragments and of specific contacts of fragments). We also consider the hierarchy of ’’basins” for 
the function <h and the set of statistical potentials for this hierarchy. We use this data to construct 
functional (jS]) of free energy and discuss application of this functional to investigation of proteins. 

The function $ has the meaning of binding energy of a fragment. It was shown [TT], [U] that 
similar function identihes protein domains as segments between deep local minima of —(taken as 
a function of Tth fragment), in this paper only statistics of sequences of amino acids in fragments 
was used and conformations were not taken into account. Parts of a protein with high values of 
—$ have high binding energy and therefore high rigidity, segments with low rigidity separate the 
domains. This approach can be used for description of mechanical properties of a protein globule. 

Expression (|8]) for free energy describes the energy in a hxed conformation (native state). For 
description of relaxation dynamics at small distance one can assume that any fragment of the 
protein has hxed coarse grained conformation Tj G Eg which does not change in the process of 
dynamics. Some fragments may deform but since conformations Tj G Eg are dehned up to small 
perturbation e: one can assume that classes of conformations in Eg remain unchanged. Thus coarse 
grained description does not describe the dynamics. 

Protein dynamics can be described by a model of spring with variable elasticity. The spring 
will describe the backbone of the peptide chain. Elasticity of the spring will be given by the 
function —elasticity at the i-th. residue will be equal to —<h(/i,Pj) (we may add a constant to 
the function —$(Jj,Pj) to make elasticity positive). Also one can take into account adhesion of 
fragments described by the function T. 

The function —<h (considered as a function of the number of fragment i) has the described 
above hierarchical form — graph of the function has the form of a hierarchy of elevations separated 
by local minima of hierarchical depth. This observation may be compared with the concept of 
molecular machine as a realization of hierarchical (fractal, or crumpled) globule |2H [2S] , see also 
[26] for discussion of relation of fractal globules, space hlling curves and DNA conformation in 
chromosomes. The discussed in the present paper approach differs from the hierarchical version 
of the model of elastic networks |2^ considered in |2T1[25]. First, we consider the model of spring 
with variable elasticity instead of the model of elastic network, second, the origin of hierarchy in 
the model is the hierarchical structure of the function $ instead of a hierarchical structure of the 
protein globule. 

Hierarchy of ’’basins” for the function <h (branches (Pi)yi of the tree T(/, P) of basins) can be 
considered as a description of the construction of the molecular machine — the deeper the branch 
the higher the energy of deformation of this branch, one branch can move relative to another as 
a whole. 


9 Conclusion 

In the present paper we construct a model of free energy of proteins based on bioinformatics. 
Using statistics of short fragments in proteins, we build a family of statistical potentials which 
describe joint distribution of sequences of amino acids in the fragments and conformations of the 
fragments. 

The important feature of the model is that all the data (sequences and conformations) in the 
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model are discrete. In particular, a conformation of a protein is represented as a sequence of 
conformations of fragments, which belong to the set of small size. This kind of representation 
allows us to use data of bioinformatics (which usually have a form of sequences of some symbols) 
on a par with physical data. Possibility of this representation is granted by the smallness of 
e-entropy of the space of conformations of fragments (known from the experiments). 

In the model under consideration the free energy of a protein is a sum of statistical potentials 
of short fragments of a protein, contacts of fragments, and hierarchical family of longer fragments. 
In general, this model is example of physical model of a complex system based on big data (in our 
case on the data of bioinformatics). 

We also discuss application of the considered model of statistical potentials for protein frag¬ 
ments to structural alignment of proteins. Modification of the scoring functional for alignment 
which takes into account statistical potentials for fragments is considered and application to 
threading procedure is discussed. 
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